Intermediate Python

Lecturer: Hugo Bowne-Anderson


It is recommended that you take the course Introduction to Python prior to this course.

1 Course Description

Learning Python is crucial for any aspiring data science practitioner. Learn to visualize real data with matplotlib’s functions and get acquainted with data structures such as the dictionary and pandas DataFrame. This four-hour intermediate course will help you to build on your existing Python skills and explore new Python applications and functions that expand your repertoire and help you work more efficiently.

You’ll discover how dictionaries offer an alternative to Python lists, and why the pandas DataFrame is the most popular way of working with tabular data. In the second chapter of this course, you’ll find out how you can create and manipulate datasets, and how to access them using these structures. Hands-on practice throughout the course will build your confidence in each area.

As you progress, you’ll look at logic, control flow, filtering and loops. These functions work to control decision-making in Python programs and help you to perform more operations with your data, including repeated statements. You’ll finish the course by applying all of your new skills by using hacker statistics to calculate your chances of winning a bet.

Once you’ve completed all of the chapters, you’ll be ready to apply your new skills in your job, new career, or personal project, and be prepared to move onto more advanced Python learning.

Course materials can be found here.

2 Matplotlib

An introduction to the basic concepts of Python. Learn how to use Python interactively and by using a script. Create your first variables and acquaint yourself with Python’s basic data types.

2.1 Lecture: Basic Plots with Matplotlib

2.2 Line Plot

With matplotlib, you can create a bunch of different plots in Python. The most basic plot is the line plot. A general recipe is given here.

import matplotlib.pyplot as plt
plt.plot(x,y)
plt.show()

In the video, you already saw how much the world population has grown over the past years. Will it continue to do so? The world bank has estimates of the world population for the years 1950 up to 2100. The years are loaded in your workspace as a list called year, and the corresponding populations as a list called pop. The data can be found here.

# create year and pop
import pandas as pd
url = 'https://raw.githubusercontent.com/QuanNguyenIU/QuanNguyenIU.github.io/main/DataCamp/Python/Intermediate%20Python/year_pop.csv'
df = pd.read_csv(url)
year = list(df['year'])
pop = list(df['pop'])

# Print the last item from year and pop
print(year[-1], '\n', pop[-1])
## 2100 
##  10.85
# Import matplotlib.pyplot as plt
import matplotlib.pyplot as plt

# Make a line plot: year on the x-axis, pop on the y-axis
plt.plot(year, pop)

# Display the plot with plt.show()
plt.show()

Great! Now that you’ve built your first line plot, let’s start working on the data that professor Hans Rosling used to build his beautiful bubble chart. It was collected in 2007. Two lists are available for you:

  • life_exp which contains the life expectancy for each country and
  • gdp_cap, which contains the GDP per capita (i.e. per person) for each country expressed in US Dollars. The data can be found here.

GDP stands for Gross Domestic Product. It basically represents the size of the economy of a country. Divide this by the population and you get the GDP per capita.

# create gdp_cap and life_exp
url = 'https://raw.githubusercontent.com/QuanNguyenIU/QuanNguyenIU.github.io/main/DataCamp/Python/Intermediate%20Python/gdp_cap_life_exp.csv'
df = pd.read_csv(url)
gdp_cap = list(df.gdp_cap)
life_exp = list(df.life_exp)

# Print the last item of gdp_cap and life_exp
print(gdp_cap[-1], '\n', life_exp[-1])
## 469.70929810000007 
##  43.487
# Make a line plot, gdp_cap on the x-axis, life_exp on the y-axis
plt.plot(gdp_cap, life_exp)

# Display the plot
plt.show()

Well done, but this doesn’t look right. Let’s build a plot that makes more sense.

2.3 Scatter Plot

When you have a time scale along the horizontal axis, the line plot is your friend. But in many other cases, when you’re trying to assess if there’s a correlation between two variables, for example, the scatter plot is the better choice. Below is an example of how to build a scatter plot.

plt.scatter(x, y)
plt.show()

Let’s continue with the gdp_cap versus life_exp plot, the GDP and life expectancy data for different countries in 2007. Maybe a scatter plot will be a better alternative?

# Change the line plot below to a scatter plot
plt.scatter(gdp_cap, life_exp)

# Put the x-axis on a logarithmic scale
plt.xscale('log')

# Show plot
plt.show()

That looks much better! You see that the higher GDP usually corresponds to a higher life expectancy. In other words, there is a positive correlation.

Do you think there’s a relationship between population and life expectancy of a country? The list life_exp from the previous exercise is already available. In addition, now also pop is available, listing the corresponding populations for the countries in 2007. The populations are in millions of people. The data can be found here.

# create pop
url = 'https://raw.githubusercontent.com/QuanNguyenIU/QuanNguyenIU.github.io/main/DataCamp/Python/Intermediate%20Python/pop.csv'
df = pd.read_csv(url)
pop = list(df['pop'])

# Build Scatter plot
plt.scatter(pop, life_exp)

# Show plot
plt.show()

Nice! There’s no clear relationship between population and life expectancy, which makes perfect sense.

2.4 Lecture: Histogram

2.5 Build a Histogram

To see how life expectancy in different countries is distributed, let’s create a histogram of life_exp.

# Create histogram of life_exp data
plt.hist(life_exp)

# Display histogram
plt.show()

Great job! In the above plot, you didn’t specify the number of bins. By default, Python sets the number of bins to 10 in that case. The number of bins is pretty important. Too few bins will oversimplify reality and won’t show you the details. Too many bins will over-complicate reality and won’t show the bigger picture.

To control the number of bins to divide your data in, you can set the bins argument.

# Build histogram with 5 bins
plt.hist(life_exp, bins = 5)
plt.show()

# Build histogram with 20 bins
plt.hist(life_exp, bins = 20)
plt.show()

2.6 Choose the right Plot

In the video, you saw population pyramids for the present day and for the future. Because we were using a histogram, it was very easy to make a comparison.

Let’s do a similar comparison. life_exp contains life expectancy data for different countries in 2007. You also have access to a second list now, life_exp1950, containing similar data for 1950. The data can be found here.

# create life_exp1950
url = 'https://raw.githubusercontent.com/QuanNguyenIU/QuanNguyenIU.github.io/main/DataCamp/Python/Intermediate%20Python/life_exp1950.csv'
df = pd.read_csv(url)
life_exp1950 = list(df['life_exp1950'])

# Histogram of life_exp, 15 bins
plt.hist(life_exp, bins = 15)
plt.show()

# Histogram of life_exp1950, 15 bins
plt.hist(life_exp1950, bins = 15)
plt.show()

2.7 Lecture: Customization

2.8 Labels

It’s time to customize your own plot. This is the fun part, you will see your plot come to life!

You’re going to work on the scatter plot with world development data: GDP per capita on the x-axis (logarithmic scale), life expectancy on the y-axis.

# Basic scatter plot, log scale
plt.scatter(gdp_cap, life_exp)
plt.xscale('log') 

# Strings
xlab = 'GDP per Capita [in USD]'
ylab = 'Life Expectancy [in years]'
title = 'World Development in 2007'

# Add axis labels
plt.xlabel(xlab)
plt.ylabel(ylab)

# Add title
plt.title(title)

# After customizing, display the plot
plt.show()

This looks much better already!

2.9 Ticks

In the video, Hugo has demonstrated how you could control the y-ticks by specifying two arguments:

plt.yticks([0, 1, 2], ['one', 'two', 'three'])

In this example, the ticks corresponding to the numbers 0, 1 and 2 will be replaced by one, two and three, respectively.

Let’s do a similar thing for the x-axis of your world development chart, with the xticks() function. The tick values 1000, 10000 and 100000 should be replaced by 1k, 10k and 100k.

# Scatter plot
plt.scatter(gdp_cap, life_exp)

# Previous customizations
plt.xscale('log') 
plt.xlabel(xlab)
plt.ylabel(ylab)
plt.title(title)

# Definition of tick_val and tick_lab
tick_val = [1000, 10000, 100000]
tick_lab = ['1k', '10k', '100k']

# Adapt the ticks on the x-axis
plt.xticks(tick_val, tick_lab);

# After customizing, display the plot
plt.show()

Great! Your plot is shaping up nicely!

2.10 Sizes

Right now, the scatter plot is just a cloud of blue dots, indistinguishable from each other. Let’s change this. Wouldn’t it be nice if the size of the dots corresponds to the population?

# Import numpy as np
import numpy as np

# Store pop as a numpy array: np_pop
np_pop = np.array(pop)

# Double np_pop
np_pop = np_pop * 2

# Set s argument to np_pop
plt.scatter(gdp_cap, life_exp, s = np_pop)

# Previous customizations
plt.xscale('log') 
plt.xlabel(xlab)
plt.ylabel(ylab)
plt.title(title)
plt.xticks(tick_val, tick_lab);

# Display the plot
plt.show()

2.11 Colors

The next step is making the plot more colorful! To do this, a list col has been given for you. It’s a list with a color for each corresponding country, depending on the continent the country is part of. The data can be found here.

# create col
url = 'https://raw.githubusercontent.com/QuanNguyenIU/QuanNguyenIU.github.io/main/DataCamp/Python/Intermediate%20Python/col.csv'
df = pd.read_csv(url)
col = list(df['col'])

# Specify c and alpha inside plt.scatter()
plt.scatter(x = gdp_cap, y = life_exp,
            s = np.array(pop) * 2, c = col, alpha = 0.8)

# Previous customizations
plt.xscale('log') 
plt.xlabel(xlab)
plt.ylabel(ylab)
plt.title(title)
plt.xticks(tick_val, tick_lab);

# Show the plot
plt.show()

Nice! This is looking more and more like Hans Rosling’s plot!

2.12 Additional Customizations

# Scatter plot
plt.scatter(x = gdp_cap, y = life_exp,
            s = np.array(pop) * 2, c = col, alpha = 0.8)

# Previous customizations
plt.xscale('log') 
plt.xlabel(xlab)
plt.ylabel(ylab)
plt.title(title)
plt.xticks(tick_val, tick_lab);

# Additional customizations
plt.text(1550, 71, 'India')
plt.text(5700, 80, 'China')

# Add grid() call
plt.grid()

# Show the plot
plt.show()

3 Dictionaries & Pandas

Learn about the dictionary, an alternative to the Python list, and the pandas DataFrame, the de facto standard to work with tabular data in Python. You will get hands-on practice with creating and manipulating datasets, and you’ll learn how to access the information you need from these data structures.

3.1 Lecture: Dictionaries, Part 1

3.2 Motivation for Dictionaries

To see why dictionaries are useful, have a look at the two lists defined below. countries contains the names of some European countries. capitals lists the corresponding names of their capital.

# Definition of countries and capital
countries = ['spain', 'france', 'germany', 'norway']
capitals = ['madrid', 'paris', 'berlin', 'oslo']

# Get index of 'germany': ind_ger
ind_ger = countries.index('germany')

# Use ind_ger to print out capital of Germany
print(capitals[ind_ger])
## berlin

As Hugo already told you: this works, but it’s not very convenient.

3.3 Create Dictionaries

The countries and capitals lists are again available below. Let’s convert this data to a dictionary where the country names are the keys and the capitals are the corresponding values. As a refresher, here is a recipe for creating a dictionary:

my_dict = {
   'key1':'value1',
   'key2':'value2',
}
# Definition of countries and capital
countries = ['spain', 'france', 'germany', 'norway']
capitals = ['madrid', 'paris', 'berlin', 'oslo']

# From string in countries and capitals, create dictionary europe
europe = {'spain':'madrid', 'france':'paris',
          'germany':'berlin', 'norway':'oslo'}

# Print europe
print(europe)
## {'spain': 'madrid', 'france': 'paris', 'germany': 'berlin', 'norway': 'oslo'}

Great! Now that you’ve built your first dictionaries, let’s get serious!

3.4 Access Dictionaries

If the keys of a dictionary are chosen wisely, accessing the values in a dictionary is easy and intuitive. For example, to get the capital for France from europe you can use:

europe['france']

Here, ‘france’ is the key and ‘paris’ the value is returned.

# Print out the keys in europe
print(europe.keys())
## dict_keys(['spain', 'france', 'germany', 'norway'])
# Print out value that belongs to key 'norway'
print(europe['norway'])
## oslo

Good job, now you’re warmed up for some more.

3.5 Lecture: Dictionaries, Part 2

3.6 Dictionary Manipulation

If you know how to access a dictionary, you can also assign a new value to it. To add a new key-value pair to europe you can use something like this:

europe['iceland'] = 'reykjavik'
# Add italy to europe
europe['italy'] = 'rome'

# Print out italy in europe
print('italy' in europe)
## True
# Add poland to europe
europe['poland'] = 'warsaw'

# Print europe
print(europe)
## {'spain': 'madrid', 'france': 'paris', 'germany': 'berlin', 'norway': 'oslo', 'italy': 'rome', 'poland': 'warsaw'}

Well done! Europe is growing by the minute! Did you notice that the order of the printout is not the same as the order in the dictionary’s definition? That’s because dictionaries are inherently unordered.

Somebody thought it would be funny to mess with your accurately generated dictionary. An adapted version of the europe dictionary is available below. Let’s clean up! Do not do this by adapting the definition of europe, but by adding Python commands to update and remove key:value pairs.

# Definition of dictionary
europe = {'spain':'madrid', 'france':'paris', 'germany':'bonn',
          'norway':'oslo', 'italy':'rome', 'poland':'warsaw',
          'australia':'vienna'}

# Update capital of germany
europe['germany'] = 'berlin'

# Remove australia
del europe['australia']

# Print europe
print(europe)
## {'spain': 'madrid', 'france': 'paris', 'germany': 'berlin', 'norway': 'oslo', 'italy': 'rome', 'poland': 'warsaw'}

Great job! That’s much better!

3.7 Dictionariception

Remember lists? They could contain anything, even other lists. Well, for dictionaries the same holds. Dictionaries can contain key:value pairs where the values are again dictionaries.

As an example, have a look at the script below where another version of europe - the dictionary you’ve been working with all along - is coded. The keys are still the country names, but the values are dictionaries that contain more information than just the capital.

It’s perfectly possible to chain square brackets to select elements. To fetch the population for Spain from europe, for example, you need:

europe['spain']['population']
# Dictionary of dictionaries
europe = {'spain': {'capital':'madrid', 'population':46.77},
          'france': {'capital':'paris', 'population':66.03},
          'germany': {'capital':'berlin', 'population':80.62},
          'norway': {'capital':'oslo', 'population':5.084}
          }


# Print out the capital of France
europe['france']['capital']
## 'paris'
# Create sub-dictionary data
data = {'capital':'rome', 'population':59.83}

# Add data to europe under key 'italy'
europe['italy'] = data

# Print europe
print(europe)
## {'spain': {'capital': 'madrid', 'population': 46.77}, 'france': {'capital': 'paris', 'population': 66.03}, 'germany': {'capital': 'berlin', 'population': 80.62}, 'norway': {'capital': 'oslo', 'population': 5.084}, 'italy': {'capital': 'rome', 'population': 59.83}}

Great! It’s time to learn about a new data structure!

3.8 Lecture: Pandas, Part 1

3.9 Dictionary to DataFrame

pandas is an open source library, providing high-performance, easy-to-use data structures and data analysis tools for Python. Sounds promising!

The DataFrame is one of Pandas’ most important data structures. It’s basically a way to store tabular data where you can label the rows and the columns. One way to build a DataFrame is from a dictionary.

In the exercises that follow you will be working with vehicle data from different countries. Each observation corresponds to a country and the columns give information about the number of vehicles per capita, whether people drive left or right, and so on.

Three lists are defined in the script:

  • names, containing the country names for which data is available.
  • dr, a list with booleans that tells whether people drive left or right in the corresponding country.
  • cpc, the number of motor vehicles per 1000 people in the corresponding country.

Each dictionary key is a column label and each value is a list which contains the column elements.

# Pre-defined lists
names = ['United States', 'Australia', 'Japan',
         'India', 'Russia', 'Morocco', 'Egypt']
dr =  [True, False, False, False, True, True, True]
cpc = [809, 731, 588, 18, 200, 70, 45]

# Create dictionary my_dict with three key:value pairs: my_dict
my_dict = {'country':names, 'drives_right':dr, 'cars_per_cap':cpc}

# Build a DataFrame cars from my_dict: cars
cars = pd.DataFrame(my_dict)

# Print cars
print(cars)
##          country  drives_right  cars_per_cap
## 0  United States          True           809
## 1      Australia         False           731
## 2          Japan         False           588
## 3          India         False            18
## 4         Russia          True           200
## 5        Morocco          True            70
## 6          Egypt          True            45

Good job! Notice that the columns of cars can be of different types. This was not possible with 2D NumPy arrays!

Have you noticed that the row labels (i.e. the labels for the different observations) were automatically set to integers from 0 up to 6? To solve this, a list row_labels has been created. You can use it to specify the row labels of the cars DataFrame. You do this by setting the index attribute of cars, that you can access as cars.index.

# Definition of row_labels
row_labels = ['US', 'AUS', 'JPN', 'IN', 'RU', 'MOR', 'EG']

# Specify row labels of cars
cars.index = row_labels

# Print cars again
print(cars)
##            country  drives_right  cars_per_cap
## US   United States          True           809
## AUS      Australia         False           731
## JPN          Japan         False           588
## IN           India         False            18
## RU          Russia          True           200
## MOR        Morocco          True            70
## EG           Egypt          True            45

Nice! That looks much better already!

3.10 CSV to DataFrame

Putting data in a dictionary and then building a DataFrame works, but it’s not very efficient. What if you’re dealing with millions of observations? In those cases, the data is typically available as files with a regular structure. One of those file types is the CSV file, which is short for “comma-separated values”.

To import CSV data into Python as a Pandas DataFrame you can use read_csv().

Let’s explore this function with the same cars data from the previous exercises. This time, however, the data is available in a CSV file, named cars.csv. The data can be found here.

# Import the cars.csv data: cars
url = 'https://raw.githubusercontent.com/QuanNguyenIU/QuanNguyenIU.github.io/main/DataCamp/Python/Intermediate%20Python/cars.csv'
cars = pd.read_csv(url)

# Print out cars
print(cars)
##   Unnamed: 0  cars_per_cap        country  drives_right
## 0         US           809  United States          True
## 1        AUS           731      Australia         False
## 2        JPN           588          Japan         False
## 3         IN            18          India         False
## 4         RU           200         Russia          True
## 5        MOR            70        Morocco          True
## 6         EG            45          Egypt          True

Nice job! Looks nice, but not exactly what we expected. Your read_csv() call to import the CSV data didn’t generate an error, but the output is not entirely what we wanted. The row labels were imported as another column without a name.

Remember index_col, an argument of read_csv(), that you can use to specify which column in the CSV file should be used as a row label? Well, that’s exactly what you need here!

# Fix import by including index_col
cars = pd.read_csv(url, index_col = 0)

# Print out cars
print(cars)
##      cars_per_cap        country  drives_right
## US            809  United States          True
## AUS           731      Australia         False
## JPN           588          Japan         False
## IN             18          India         False
## RU            200         Russia          True
## MOR            70        Morocco          True
## EG             45          Egypt          True

That’s much better!

3.11 Lecture: Pandas, Part 2

3.12 Square Brackets

In the video, you saw that you can index and select Pandas DataFrames in many different ways. The simplest, but not the most powerful way, is to use square brackets. To select only the cars_per_cap column from cars, you can use:

cars['cars_per_cap']
cars[['cars_per_cap']]

The single bracket version gives a Pandas Series, the double bracket version gives a Pandas DataFrame.

# Print out country column as Pandas Series
print(cars['country'])
## US     United States
## AUS        Australia
## JPN            Japan
## IN             India
## RU            Russia
## MOR          Morocco
## EG             Egypt
## Name: country, dtype: object
# Print out country column as Pandas DataFrame
print(cars[['country']])
##            country
## US   United States
## AUS      Australia
## JPN          Japan
## IN           India
## RU          Russia
## MOR        Morocco
## EG           Egypt
# Print out DataFrame with country and drives_right columns
print(cars[['country', 'drives_right']])
##            country  drives_right
## US   United States          True
## AUS      Australia         False
## JPN          Japan         False
## IN           India         False
## RU          Russia          True
## MOR        Morocco          True
## EG           Egypt          True

Nice! Square brackets can do more than just selecting columns. You can also use them to get rows, or observations, from a DataFrame. The following call selects the first five rows from the cars DataFrame:

cars[0:5]

The result is another DataFrame containing only the rows you specified. Pay attention: You can only select rows using square brackets if you specify a slice, like 0:4. Also, you’re using the integer indexes of the rows here, not the row labels!

# Print out first 3 observations
print(cars[:3])
##      cars_per_cap        country  drives_right
## US            809  United States          True
## AUS           731      Australia         False
## JPN           588          Japan         False
# Print out fourth, fifth and sixth observation
print(cars[3:6])
##      cars_per_cap  country  drives_right
## IN             18    India         False
## RU            200   Russia          True
## MOR            70  Morocco          True

You can get interesting information, but using square brackets to do indexing is rather limited. Experiment with more advanced techniques in the following exercises.

3.13 loc and iloc

With loc and iloc you can do practically any data selection operation on DataFrames you can think of. loc is label-based, which means that you have to specify rows and columns based on their row and column labels. iloc is integer index based, so you have to specify rows and columns by their integer index like you did in the previous exercise.

Try out the following commands to experiment with loc and iloc to select observations. Each pair of commands here gives the same result.

cars.loc['RU']
cars.iloc[4]

cars.loc[['RU']]
cars.iloc[[4]]

cars.loc[['RU', 'AUS']]
cars.iloc[[4, 1]]
# Print out observation for Japan
print(cars.loc['JPN'])
## cars_per_cap      588
## country         Japan
## drives_right    False
## Name: JPN, dtype: object
# Print out observations for Australia and Egypt
print(cars.loc[['AUS', 'EG']])
##      cars_per_cap    country  drives_right
## AUS           731  Australia         False
## EG             45      Egypt          True

loc and iloc also allow you to select both rows and columns from a DataFrame. To experiment, try out the following commands. Again, paired commands produce the same result.

cars.loc['IN', 'cars_per_cap']
cars.iloc[3, 0]

cars.loc[['IN', 'RU'], 'cars_per_cap']
cars.iloc[[3, 4], 0]

cars.loc[['IN', 'RU'], ['cars_per_cap', 'country']]
cars.iloc[[3, 4], [0, 1]]
# Print out drives_right value of Morocco
print(cars.loc['MOR', 'drives_right'])
## True
# Print sub-DataFrame
print(cars.loc[['RU', 'MOR'], ['country', 'drives_right']])
##      country  drives_right
## RU    Russia          True
## MOR  Morocco          True

It’s also possible to select only columns with loc and iloc. In both cases, you simply put a slice going from beginning to end in front of the comma:

cars.loc[:, 'country']
cars.iloc[:, 1]

cars.loc[:, ['country','drives_right']]
cars.iloc[:, [1, 2]]
# Print out drives_right column as Series
print(cars.loc[:, 'drives_right'])
## US      True
## AUS    False
## JPN    False
## IN     False
## RU      True
## MOR     True
## EG      True
## Name: drives_right, dtype: bool
# Print out drives_right column as DataFrame
print(cars.loc[:, ['drives_right']])
##      drives_right
## US           True
## AUS         False
## JPN         False
## IN          False
## RU           True
## MOR          True
## EG           True
# Print out cars_per_cap and drives_right as DataFrame
print(cars.loc[:, ['cars_per_cap', 'drives_right']])
##      cars_per_cap  drives_right
## US            809          True
## AUS           731         False
## JPN           588         False
## IN             18         False
## RU            200          True
## MOR            70          True
## EG             45          True

What a drill on indexing and selecting data from Pandas DataFrames! You’ve done great! It’s time to head over to Chapter 3 to learn all about logic, control flow, and filtering!

4 Logic, Control Flow & Filtering

Boolean logic is the foundation of decision-making in Python programs. Learn about different comparison operators, how to combine them with Boolean operators, and how to use the Boolean outcomes in control structures. You’ll also learn to filter data in pandas DataFrames using logic.

4.1 Lecture: Comparison Operators

4.2 Equality

To check if two Python values, or variables, are equal you can use \(==\). To check for inequality, you need \(!=\). As a refresher, have a look at the following examples that all result in True.

2 == (1 + 1)
"intermediate" != "python"
True != False
"Python" != "python"

When you write these comparisons in a script, you will need to wrap a print() function around them to see the output.

# Comparison of booleans
print(True == False)
## False
# Comparison of integers
print(-5 * 15 != 75)
## True
# Comparison of strings
print('pyscript' == 'PyScript')
## False
# Compare a boolean with an integer
print(True == 1)
## True

The last comparison worked fine because actually, a boolean is a special kind of integer: True corresponds to 1, False corresponds to 0.

4.3 Greater and Less than

In the video, Hugo also talked about the less than and greater than signs, \(<\) and \(>\) in Python. You can combine them with an equals sign: \(<=\) and \(>=\). Pay attention: \(<=\) is valid syntax, but \(=<\) is not.

All Python expressions in the following code chunk evaluate to True:

3 < 4
3 <= 4
"alpha" <= "beta"

Remember that for string comparison, Python determines the relationship based on alphabetical order.

# Comparison of integers
x = -3 * 6
print( x >= -10)
## False
# Comparison of strings
y = "test"
print('test' <= y)
## True
# Comparison of booleans
print(True > False)
## True

4.4 Compare Arrays

Out of the box, you can also use comparison operators with NumPy arrays.

Remember areas, the list of area measurements for different rooms in your house from Introduction to Python? This time there are two NumPy arrays: my_house and your_house. They both contain the areas for the kitchen, living room, bedroom and bathroom in the same order, so you can compare them.

# Create arrays
my_house = np.array([18.0, 20.0, 10.75, 9.50])
your_house = np.array([14.0, 24.0, 14.25, 9.0])

# my_house greater than or equal to 18
print(my_house >= 18)
## [ True  True False False]
# my_house less than your_house
print(my_house < your_house)
## [False  True  True False]

4.5 Lecture: Booleans Operators

4.6 and, or, not

A boolean is either 1 or 0, True or False. With boolean operators such as and, or and not, you can combine these booleans to perform more advanced queries on your data.

# Define variables
my_kitchen = 18.0
your_kitchen = 14.0

# my_kitchen bigger than 10 and smaller than 18?
print(my_kitchen > 10 and my_kitchen < 18)
## False
# my_kitchen smaller than 14 or bigger than 17?
print(my_kitchen < 14 or my_kitchen > 17)
## True
# Double my_kitchen smaller than triple your_kitchen?
print(2 * my_kitchen < 3 * your_kitchen)
## True

4.7 Boolean Operators with NumPy

Before, the operational operators like \(<\) and \(>=\) worked with NumPy arrays out of the box. Unfortunately, this is not true for the boolean operators and, or, and not.

To use these operators with NumPy, you will need np.logical_and(), np.logical_or() and np.logical_not(). Here’s an example on the my_house and your_house arrays from before to give you an idea:

np.logical_and(my_house > 13, your_house < 15)
# my_house greater than 18.5 or smaller than 10
print(np.logical_or(my_house > 18.5, my_house < 10))
## [False  True False  True]
# Both my_house and your_house smaller than 11
print(np.logical_and(my_house < 11, your_house < 11))
## [False False False  True]

4.8 Lecture: if, elif, else

4.9 if

It’s time to take a closer look around in your house.

# Define variables
room = "kit"
area = 14.0

# if statement for room
if room == "kit" :
    print("looking around in the kitchen.")
## looking around in the kitchen.

# if statement for area
if area > 15:
  print('big place!')

big place! wasn’t printed, because area > 15 is not True. Experiment with other values of room and area to see how the printouts change.

4.10 Add else

# if-else construct for room
if room == "kit" :
    print("looking around in the kitchen.")
else :
    print("looking around elsewhere.")
## looking around in the kitchen.
# if-else construct for area
if area > 15 :
    print("big place!")
else:
  print('pretty small.')
## pretty small.

4.11 Customizing Further: elif

It’s also possible to have a look around in the bedroom.

# Define variables
room = "bed"
area = 14.0

# if-elif-else construct for room
if room == "kit" :
    print("looking around in the kitchen.")
elif room == "bed":
    print("looking around in the bedroom.")
else :
    print("looking around elsewhere.")
## looking around in the bedroom.
# if-elif-else construct for area
if area > 15 :
    print("big place!")
elif area > 10:
  print('medium size, nice!')
else :
    print("pretty small.")
## medium size, nice!

4.12 Lecture: Filtering Pandas DataFrame

4.13 Driving Right

Remember that cars dataset, containing the cars per 1000 people (cars_per_cap) and whether people drive right (drives_right) for different countries (country)?

In the video, you saw a step-by-step approach to filter observations from a DataFrame based on boolean arrays. Let’s start simple and try to find all observations in cars where drives_right is True.

drives_right is a boolean column, so you’ll have to extract it as a Series and then use this boolean Series to select observations from cars.

# Extract drives_right column as Series: dr
dr = cars.loc[:, 'drives_right']

# Use dr to subset cars: sel
sel = cars[dr]

# Print sel
print(sel)
##      cars_per_cap        country  drives_right
## US            809  United States          True
## RU            200         Russia          True
## MOR            70        Morocco          True
## EG             45          Egypt          True

The code above worked fine, but you actually unnecessarily created a new variable dr. You can achieve the same result without this intermediate variable.

# Convert code to a one-liner
sel = cars[cars['drives_right']]

# Print sel
print(sel)
##      cars_per_cap        country  drives_right
## US            809  United States          True
## RU            200         Russia          True
## MOR            70        Morocco          True
## EG             45          Egypt          True

cars contains 7 rows or observations, sel contains 4; so in the majority of the countries in your dataset, people drive on the right side of the road.

4.14 Cars per Capita

Let’s stick to the cars data some more. This time you want to find out which countries have a high cars per capita figure. In other words, in which countries do many people have a car, or maybe multiple cars.

# Create car_maniac: observations that have a cars_per_cap over 500
cpc = cars.loc[:, 'cars_per_cap']
many_cars = cpc > 500
car_maniac = cars[many_cars]

# Print car_maniac
print(car_maniac)
##      cars_per_cap        country  drives_right
## US            809  United States          True
## AUS           731      Australia         False
## JPN           588          Japan         False

The output shows that the US, Australia and Japan have a cars_per_cap of over 500.

Remember about np.logical_and(), np.logical_or() and np.logical_not(), the NumPy variants of the and, or and not operators? You can also use them on Pandas Series to do more advanced filtering operations.

Take this example that selects the observations that have a cars_per_cap between 10 and 80. Try out these lines of code step by step to see what’s happening.

cpc = cars['cars_per_cap']
between = np.logical_and(cpc > 10, cpc < 80)
medium = cars[between]
# Create medium: observations with cars_per_cap between 100 and 500
cpc = cars['cars_per_cap']
between = np.logical_and(cpc >= 100, cpc <= 500)
medium = cars[between]

# Print medium
print(medium)
##     cars_per_cap country  drives_right
## RU           200  Russia          True

5 Loops

There are several techniques you can use to repeatedly execute Python code. While loops are like repeated if statements, and for loop iterates over all kinds of data structures. Learn all about them in this chapter.

5.1 Lecture: while loop

5.2 Basic while loop

Below you can find the example from the video where the error variable, initially equal to 50.0, is divided by 4 and printed out on every run:

error = 50.0
while error > 1 :
    error = error / 4
    print(error)

This example will come in handy, because it’s time to build a while loop yourself! We’re going to code a while loop that implements a very basic control system for an inverted pendulum. If there’s an offset from standing perfectly straight, the while loop will incrementally fix this offset.

Note that if your while loop takes too long to run, you might have made a mistake. In particular, remember to indent the contents of the loop using four spaces or auto-indentation!

offset = 8
while offset != 0:
  print('correcting...')
  offset = offset - 1
  print(offset)
## correcting...
## 7
## correcting...
## 6
## correcting...
## 5
## correcting...
## 4
## correcting...
## 3
## correcting...
## 2
## correcting...
## 1
## correcting...
## 0

5.3 Add conditionals

The while loop that corrects the offset is a good start, but what if offset is negative? You can try to run the following code where offset is initialized to -6:

offset = -6
while offset != 0 :
    print("correcting...")
    offset = offset - 1
    print(offset)

The while loop will never stop running, because offset will be further decreased on every run. offset != 0 will never become False and the while loop continues forever. Fix things by putting an if-else statement inside the while loop. If your code is still taking too long to run, you probably made a mistake!

offset = -6
while offset != 0 :
    print("correcting...")
    if offset > 0 :
      offset = offset - 1
    else : 
      offset = offset + 1  
    print(offset)
## correcting...
## -5
## correcting...
## -4
## correcting...
## -3
## correcting...
## -2
## correcting...
## -1
## correcting...
## 0

The while loop is not that often used in Data Science, so let’s head over to the for loop.

5.4 Lecture: for loop

5.5 Loop over a list

Have another look at the for loop that Hugo showed in the video:

fam = [1.73, 1.68, 1.71, 1.89]
for height in fam : 
    print(height)

As usual, you simply have to indent the code with 4 spaces to tell Python which code should be executed in the for loop.

areas = [11.25, 18.0, 20.0, 10.75, 9.50]
for i in areas:
  print(i)
## 11.25
## 18.0
## 20.0
## 10.75
## 9.5

5.6 Indexes and values

Using a for loop to iterate over a list only gives you access to every list element in each run, one after the other. If you also want to access the index information, so where the list element you’re iterating over is located, you can use enumerate().

As an example, have a look at how the for loop from the video was converted:

fam = [1.73, 1.68, 1.71, 1.89]
for index, height in enumerate(fam) :
    print("person " + str(index) + ": " + str(height))
# Change for loop to use enumerate() and update print()
for i, a in enumerate(areas):
    print('room ' + str(i) + ': ' + str(a))
## room 0: 11.25
## room 1: 18.0
## room 2: 20.0
## room 3: 10.75
## room 4: 9.5

For non-programmer folks, room 0: 11.25 is strange. Wouldn’t it be better if the count started at 1?

for index, area in enumerate(areas) :
    print("room " + str(index + 1) + ": " + str(area))
## room 1: 11.25
## room 2: 18.0
## room 3: 20.0
## room 4: 10.75
## room 5: 9.5

5.7 Loop over list of lists

Remember the house variable from the Introduction to Python course? Have a look at its definition below. It’s basically a list of lists, where each sublist contains the name and area of a room in your house.

house = [["hallway", 11.25], 
         ["kitchen", 18.0], 
         ["living room", 20.0], 
         ["bedroom", 10.75], 
         ["bathroom", 9.50]]
for a in house:
  print('the ' + str(a[0]) + ' is ' + str(a[1]) + ' sqm')
## the hallway is 11.25 sqm
## the kitchen is 18.0 sqm
## the living room is 20.0 sqm
## the bedroom is 10.75 sqm
## the bathroom is 9.5 sqm

5.8 Lecture: Loop Data Structures, Part 1

5.9 Loop over dictionary

In Python 3, you need the items() method to loop over a dictionary:

world = { "afghanistan":30.55, 
          "albania":2.77,
          "algeria":39.21 }

for key, value in world.items() :
    print(key + " -- " + str(value))

Remember the europe dictionary that contained the names of some European countries as key and their capitals as corresponding value? Let’s write a loop to iterate over it!

# Definition of dictionary
europe = {'spain':'madrid', 'france':'paris', 'germany':'berlin',
          'norway':'oslo', 'italy':'rome', 'poland':'warsaw', 'austria':'vienna' }
          
# Iterate over europe
for key, value in europe.items():
  print('the capital of ' + str(key) + ' is ' + str(value))
## the capital of spain is madrid
## the capital of france is paris
## the capital of germany is berlin
## the capital of norway is oslo
## the capital of italy is rome
## the capital of poland is warsaw
## the capital of austria is vienna

5.10 Loop over NumPy array

If you’re dealing with a 1D NumPy array, looping over all elements can be as simple as:

for x in my_array :
    ...

If you’re dealing with a 2D NumPy array, it’s more complicated. A 2D array is built up of multiple 1D arrays. To explicitly iterate over all separate elements of a multi-dimensional array, you’ll need this syntax:

for x in np.nditer(my_array) :
    ...

Two NumPy arrays that you might recognize from the intro course are available: np_height, a NumPy array containing the heights of Major League Baseball players, and np_baseball, a 2D NumPy array that contains both the heights (first column) and weights (second column) of those players.

# create height_in and weight_lb
url = 'https://raw.githubusercontent.com/QuanNguyenIU/QuanNguyenIU.github.io/main/DataCamp/Python/Intro.%20to%20Python/baseball.csv'
df = pd.read_csv(url)
height_in = list(df.Height)
weight_lb = list(df.Weight)
baseball = [list(i) for i in list(zip(height_in, weight_lb))]
np_height = np.array(height_in)
np_baseball = np.array(baseball)

# For loop over np_height
for x in np_height:
  print(str(x) + ' inches', end = '; ')
## 74 inches; 74 inches; 72 inches; 72 inches; 73 inches; 69 inches; 69 inches; 71 inches; 76 inches; 71 inches; 73 inches; 73 inches; 74 inches; 74 inches; 69 inches; 70 inches; 73 inches; 75 inches; 78 inches; 79 inches; 76 inches; 74 inches; 76 inches; 72 inches; 71 inches; 75 inches; 77 inches; 74 inches; 73 inches; 74 inches; 78 inches; 73 inches; 75 inches; 73 inches; 75 inches; 75 inches; 74 inches; 69 inches; 71 inches; 74 inches; 73 inches; 73 inches; 76 inches; 74 inches; 74 inches; 70 inches; 72 inches; 77 inches; 74 inches; 70 inches; 73 inches; 75 inches; 76 inches; 76 inches; 78 inches; 74 inches; 74 inches; 76 inches; 77 inches; 81 inches; 78 inches; 75 inches; 77 inches; 75 inches; 76 inches; 74 inches; 72 inches; 72 inches; 75 inches; 73 inches; 73 inches; 73 inches; 70 inches; 70 inches; 70 inches; 76 inches; 68 inches; 71 inches; 72 inches; 75 inches; 75 inches; 75 inches; 75 inches; 68 inches; 74 inches; 78 inches; 71 inches; 73 inches; 76 inches; 74 inches; 74 inches; 79 inches; 75 inches; 73 inches; 76 inches; 74 inches; 74 inches; 73 inches; 72 inches; 74 inches; 73 inches; 74 inches; 72 inches; 73 inches; 69 inches; 72 inches; 73 inches; 75 inches; 75 inches; 73 inches; 72 inches; 72 inches; 76 inches; 74 inches; 72 inches; 77 inches; 74 inches; 77 inches; 75 inches; 76 inches; 80 inches; 74 inches; 74 inches; 75 inches; 78 inches; 73 inches; 73 inches; 74 inches; 75 inches; 76 inches; 71 inches; 73 inches; 74 inches; 76 inches; 76 inches; 74 inches; 73 inches; 74 inches; 70 inches; 72 inches; 73 inches; 73 inches; 73 inches; 73 inches; 71 inches; 74 inches; 74 inches; 72 inches; 74 inches; 71 inches; 74 inches; 73 inches; 75 inches; 75 inches; 79 inches; 73 inches; 75 inches; 76 inches; 74 inches; 76 inches; 78 inches; 74 inches; 76 inches; 72 inches; 74 inches; 76 inches; 74 inches; 75 inches; 78 inches; 75 inches; 72 inches; 74 inches; 72 inches; 74 inches; 70 inches; 71 inches; 70 inches; 75 inches; 71 inches; 71 inches; 73 inches; 72 inches; 71 inches; 73 inches; 72 inches; 75 inches; 74 inches; 74 inches; 75 inches; 73 inches; 77 inches; 73 inches; 76 inches; 75 inches; 74 inches; 76 inches; 75 inches; 73 inches; 71 inches; 76 inches; 75 inches; 72 inches; 71 inches; 77 inches; 73 inches; 74 inches; 71 inches; 72 inches; 74 inches; 75 inches; 73 inches; 72 inches; 75 inches; 75 inches; 74 inches; 72 inches; 74 inches; 71 inches; 70 inches; 74 inches; 77 inches; 77 inches; 75 inches; 75 inches; 78 inches; 75 inches; 76 inches; 73 inches; 75 inches; 75 inches; 79 inches; 77 inches; 76 inches; 71 inches; 75 inches; 74 inches; 69 inches; 71 inches; 76 inches; 72 inches; 72 inches; 70 inches; 72 inches; 73 inches; 71 inches; 72 inches; 71 inches; 73 inches; 72 inches; 73 inches; 74 inches; 74 inches; 72 inches; 75 inches; 74 inches; 74 inches; 77 inches; 75 inches; 73 inches; 72 inches; 71 inches; 74 inches; 77 inches; 75 inches; 75 inches; 75 inches; 78 inches; 78 inches; 74 inches; 76 inches; 78 inches; 76 inches; 70 inches; 72 inches; 80 inches; 74 inches; 74 inches; 71 inches; 70 inches; 72 inches; 71 inches; 74 inches; 71 inches; 72 inches; 71 inches; 74 inches; 69 inches; 76 inches; 75 inches; 75 inches; 76 inches; 73 inches; 76 inches; 73 inches; 77 inches; 73 inches; 72 inches; 72 inches; 77 inches; 77 inches; 71 inches; 74 inches; 74 inches; 73 inches; 78 inches; 75 inches; 73 inches; 70 inches; 74 inches; 72 inches; 73 inches; 73 inches; 75 inches; 75 inches; 74 inches; 76 inches; 73 inches; 74 inches; 75 inches; 75 inches; 72 inches; 73 inches; 73 inches; 72 inches; 74 inches; 78 inches; 76 inches; 73 inches; 74 inches; 75 inches; 70 inches; 75 inches; 71 inches; 72 inches; 78 inches; 75 inches; 73 inches; 73 inches; 71 inches; 75 inches; 77 inches; 72 inches; 69 inches; 73 inches; 74 inches; 72 inches; 70 inches; 75 inches; 70 inches; 72 inches; 72 inches; 74 inches; 73 inches; 74 inches; 76 inches; 75 inches; 80 inches; 72 inches; 75 inches; 73 inches; 74 inches; 74 inches; 73 inches; 75 inches; 75 inches; 71 inches; 73 inches; 75 inches; 74 inches; 74 inches; 72 inches; 74 inches; 74 inches; 74 inches; 73 inches; 76 inches; 75 inches; 72 inches; 73 inches; 73 inches; 73 inches; 72 inches; 72 inches; 72 inches; 72 inches; 71 inches; 75 inches; 75 inches; 74 inches; 73 inches; 75 inches; 79 inches; 74 inches; 76 inches; 73 inches; 74 inches; 74 inches; 72 inches; 74 inches; 74 inches; 75 inches; 78 inches; 74 inches; 74 inches; 74 inches; 77 inches; 70 inches; 73 inches; 74 inches; 73 inches; 71 inches; 75 inches; 71 inches; 72 inches; 77 inches; 74 inches; 70 inches; 77 inches; 73 inches; 72 inches; 76 inches; 71 inches; 76 inches; 78 inches; 75 inches; 73 inches; 78 inches; 74 inches; 79 inches; 75 inches; 76 inches; 72 inches; 75 inches; 75 inches; 70 inches; 72 inches; 70 inches; 74 inches; 71 inches; 76 inches; 73 inches; 76 inches; 71 inches; 69 inches; 72 inches; 72 inches; 69 inches; 73 inches; 69 inches; 73 inches; 74 inches; 74 inches; 72 inches; 71 inches; 72 inches; 72 inches; 76 inches; 76 inches; 76 inches; 74 inches; 76 inches; 75 inches; 71 inches; 72 inches; 71 inches; 73 inches; 75 inches; 76 inches; 75 inches; 71 inches; 75 inches; 74 inches; 72 inches; 73 inches; 73 inches; 73 inches; 73 inches; 76 inches; 72 inches; 76 inches; 73 inches; 73 inches; 73 inches; 75 inches; 75 inches; 77 inches; 73 inches; 72 inches; 75 inches; 70 inches; 74 inches; 72 inches; 80 inches; 71 inches; 71 inches; 74 inches; 74 inches; 73 inches; 75 inches; 76 inches; 73 inches; 77 inches; 72 inches; 73 inches; 77 inches; 76 inches; 71 inches; 75 inches; 73 inches; 74 inches; 77 inches; 71 inches; 72 inches; 73 inches; 69 inches; 73 inches; 70 inches; 74 inches; 76 inches; 73 inches; 73 inches; 75 inches; 73 inches; 79 inches; 74 inches; 73 inches; 74 inches; 77 inches; 75 inches; 74 inches; 73 inches; 77 inches; 73 inches; 77 inches; 74 inches; 74 inches; 73 inches; 77 inches; 74 inches; 77 inches; 75 inches; 77 inches; 75 inches; 71 inches; 74 inches; 70 inches; 79 inches; 72 inches; 72 inches; 70 inches; 74 inches; 74 inches; 72 inches; 73 inches; 72 inches; 74 inches; 74 inches; 76 inches; 82 inches; 74 inches; 74 inches; 70 inches; 73 inches; 73 inches; 74 inches; 77 inches; 72 inches; 76 inches; 73 inches; 73 inches; 72 inches; 74 inches; 74 inches; 71 inches; 72 inches; 75 inches; 74 inches; 74 inches; 77 inches; 70 inches; 71 inches; 73 inches; 76 inches; 71 inches; 75 inches; 74 inches; 72 inches; 76 inches; 79 inches; 76 inches; 73 inches; 76 inches; 78 inches; 75 inches; 76 inches; 72 inches; 72 inches; 73 inches; 73 inches; 75 inches; 71 inches; 76 inches; 70 inches; 75 inches; 74 inches; 75 inches; 73 inches; 71 inches; 71 inches; 72 inches; 73 inches; 73 inches; 72 inches; 69 inches; 73 inches; 78 inches; 71 inches; 73 inches; 75 inches; 76 inches; 70 inches; 74 inches; 77 inches; 75 inches; 79 inches; 72 inches; 77 inches; 73 inches; 75 inches; 75 inches; 75 inches; 73 inches; 73 inches; 76 inches; 77 inches; 75 inches; 70 inches; 71 inches; 71 inches; 75 inches; 74 inches; 69 inches; 70 inches; 75 inches; 72 inches; 75 inches; 73 inches; 72 inches; 72 inches; 72 inches; 76 inches; 75 inches; 74 inches; 69 inches; 73 inches; 72 inches; 72 inches; 75 inches; 77 inches; 76 inches; 80 inches; 77 inches; 76 inches; 79 inches; 71 inches; 75 inches; 73 inches; 76 inches; 77 inches; 73 inches; 76 inches; 70 inches; 75 inches; 73 inches; 75 inches; 70 inches; 69 inches; 71 inches; 72 inches; 72 inches; 73 inches; 70 inches; 70 inches; 73 inches; 76 inches; 75 inches; 72 inches; 73 inches; 79 inches; 71 inches; 72 inches; 74 inches; 74 inches; 74 inches; 72 inches; 76 inches; 76 inches; 72 inches; 72 inches; 71 inches; 72 inches; 72 inches; 70 inches; 77 inches; 74 inches; 72 inches; 76 inches; 71 inches; 76 inches; 71 inches; 73 inches; 70 inches; 73 inches; 73 inches; 72 inches; 71 inches; 71 inches; 71 inches; 72 inches; 72 inches; 74 inches; 74 inches; 74 inches; 71 inches; 72 inches; 75 inches; 72 inches; 71 inches; 72 inches; 72 inches; 72 inches; 72 inches; 74 inches; 74 inches; 77 inches; 75 inches; 73 inches; 75 inches; 73 inches; 76 inches; 72 inches; 77 inches; 75 inches; 72 inches; 71 inches; 71 inches; 75 inches; 72 inches; 73 inches; 73 inches; 71 inches; 70 inches; 75 inches; 71 inches; 76 inches; 73 inches; 68 inches; 71 inches; 72 inches; 74 inches; 77 inches; 72 inches; 76 inches; 78 inches; 81 inches; 72 inches; 73 inches; 76 inches; 72 inches; 72 inches; 74 inches; 76 inches; 73 inches; 76 inches; 75 inches; 70 inches; 71 inches; 74 inches; 72 inches; 73 inches; 76 inches; 76 inches; 73 inches; 71 inches; 68 inches; 71 inches; 71 inches; 74 inches; 77 inches; 69 inches; 72 inches; 76 inches; 75 inches; 76 inches; 75 inches; 76 inches; 72 inches; 74 inches; 76 inches; 74 inches; 72 inches; 75 inches; 78 inches; 77 inches; 70 inches; 72 inches; 79 inches; 74 inches; 71 inches; 68 inches; 77 inches; 75 inches; 71 inches; 72 inches; 70 inches; 72 inches; 72 inches; 73 inches; 72 inches; 74 inches; 72 inches; 72 inches; 75 inches; 72 inches; 73 inches; 74 inches; 72 inches; 78 inches; 75 inches; 72 inches; 74 inches; 75 inches; 75 inches; 76 inches; 74 inches; 74 inches; 73 inches; 74 inches; 71 inches; 74 inches; 75 inches; 76 inches; 74 inches; 76 inches; 76 inches; 73 inches; 75 inches; 75 inches; 74 inches; 68 inches; 72 inches; 75 inches; 71 inches; 70 inches; 72 inches; 73 inches; 72 inches; 75 inches; 74 inches; 70 inches; 76 inches; 71 inches; 82 inches; 72 inches; 73 inches; 74 inches; 71 inches; 75 inches; 77 inches; 72 inches; 74 inches; 72 inches; 73 inches; 78 inches; 77 inches; 73 inches; 73 inches; 73 inches; 73 inches; 73 inches; 76 inches; 75 inches; 70 inches; 73 inches; 72 inches; 73 inches; 75 inches; 74 inches; 73 inches; 73 inches; 76 inches; 73 inches; 75 inches; 70 inches; 77 inches; 72 inches; 77 inches; 74 inches; 75 inches; 75 inches; 75 inches; 75 inches; 72 inches; 74 inches; 71 inches; 76 inches; 71 inches; 75 inches; 76 inches; 83 inches; 75 inches; 74 inches; 76 inches; 72 inches; 72 inches; 75 inches; 75 inches; 72 inches; 77 inches; 73 inches; 72 inches; 70 inches; 74 inches; 72 inches; 74 inches; 72 inches; 71 inches; 70 inches; 71 inches; 76 inches; 74 inches; 76 inches; 74 inches; 74 inches; 74 inches; 75 inches; 75 inches; 71 inches; 71 inches; 74 inches; 77 inches; 71 inches; 74 inches; 75 inches; 77 inches; 76 inches; 74 inches; 76 inches; 72 inches; 71 inches; 72 inches; 75 inches; 73 inches; 68 inches; 72 inches; 69 inches; 73 inches; 73 inches; 75 inches; 70 inches; 70 inches; 74 inches; 75 inches; 74 inches; 74 inches; 73 inches; 74 inches; 75 inches; 77 inches; 73 inches; 74 inches; 76 inches; 74 inches; 75 inches; 73 inches; 76 inches; 78 inches; 75 inches; 73 inches; 77 inches; 74 inches; 72 inches; 74 inches; 72 inches; 71 inches; 73 inches; 75 inches; 73 inches; 67 inches; 67 inches; 76 inches; 74 inches; 73 inches; 70 inches; 75 inches; 70 inches; 72 inches; 77 inches; 79 inches; 78 inches; 74 inches; 75 inches; 75 inches; 78 inches; 76 inches; 75 inches; 69 inches; 75 inches; 72 inches; 75 inches; 73 inches; 74 inches; 75 inches; 75 inches; 73 inches;
# For loop over np_baseball
for x in np.nditer(np_baseball):
  print(x, end = '; ')
## 74; 180; 74; 215; 72; 210; 72; 210; 73; 188; 69; 176; 69; 209; 71; 200; 76; 231; 71; 180; 73; 188; 73; 180; 74; 185; 74; 160; 69; 180; 70; 185; 73; 189; 75; 185; 78; 219; 79; 230; 76; 205; 74; 230; 76; 195; 72; 180; 71; 192; 75; 225; 77; 203; 74; 195; 73; 182; 74; 188; 78; 200; 73; 180; 75; 200; 73; 200; 75; 245; 75; 240; 74; 215; 69; 185; 71; 175; 74; 199; 73; 200; 73; 215; 76; 200; 74; 205; 74; 206; 70; 186; 72; 188; 77; 220; 74; 210; 70; 195; 73; 200; 75; 200; 76; 212; 76; 224; 78; 210; 74; 205; 74; 220; 76; 195; 77; 200; 81; 260; 78; 228; 75; 270; 77; 200; 75; 210; 76; 190; 74; 220; 72; 180; 72; 205; 75; 210; 73; 220; 73; 211; 73; 200; 70; 180; 70; 190; 70; 170; 76; 230; 68; 155; 71; 185; 72; 185; 75; 200; 75; 225; 75; 225; 75; 220; 68; 160; 74; 205; 78; 235; 71; 250; 73; 210; 76; 190; 74; 160; 74; 200; 79; 205; 75; 222; 73; 195; 76; 205; 74; 220; 74; 220; 73; 170; 72; 185; 74; 195; 73; 220; 74; 230; 72; 180; 73; 220; 69; 180; 72; 180; 73; 170; 75; 210; 75; 215; 73; 200; 72; 213; 72; 180; 76; 192; 74; 235; 72; 185; 77; 235; 74; 210; 77; 222; 75; 210; 76; 230; 80; 220; 74; 180; 74; 190; 75; 200; 78; 210; 73; 194; 73; 180; 74; 190; 75; 240; 76; 200; 71; 198; 73; 200; 74; 195; 76; 210; 76; 220; 74; 190; 73; 210; 74; 225; 70; 180; 72; 185; 73; 170; 73; 185; 73; 185; 73; 180; 71; 178; 74; 175; 74; 200; 72; 204; 74; 211; 71; 190; 74; 210; 73; 190; 75; 190; 75; 185; 79; 290; 73; 175; 75; 185; 76; 200; 74; 220; 76; 170; 78; 220; 74; 190; 76; 220; 72; 205; 74; 200; 76; 250; 74; 225; 75; 215; 78; 210; 75; 215; 72; 195; 74; 200; 72; 194; 74; 220; 70; 180; 71; 180; 70; 170; 75; 195; 71; 180; 71; 170; 73; 206; 72; 205; 71; 200; 73; 225; 72; 201; 75; 225; 74; 233; 74; 180; 75; 225; 73; 180; 77; 220; 73; 180; 76; 237; 75; 215; 74; 190; 76; 235; 75; 190; 73; 180; 71; 165; 76; 195; 75; 200; 72; 190; 71; 190; 77; 185; 73; 185; 74; 205; 71; 190; 72; 205; 74; 206; 75; 220; 73; 208; 72; 170; 75; 195; 75; 210; 74; 190; 72; 211; 74; 230; 71; 170; 70; 185; 74; 185; 77; 241; 77; 225; 75; 210; 75; 175; 78; 230; 75; 200; 76; 215; 73; 198; 75; 226; 75; 278; 79; 215; 77; 230; 76; 240; 71; 184; 75; 219; 74; 170; 69; 218; 71; 190; 76; 225; 72; 220; 72; 176; 70; 190; 72; 197; 73; 204; 71; 167; 72; 180; 71; 195; 73; 220; 72; 215; 73; 185; 74; 190; 74; 205; 72; 205; 75; 200; 74; 210; 74; 215; 77; 200; 75; 205; 73; 211; 72; 190; 71; 208; 74; 200; 77; 210; 75; 232; 75; 230; 75; 210; 78; 220; 78; 210; 74; 202; 76; 212; 78; 225; 76; 170; 70; 190; 72; 200; 80; 237; 74; 220; 74; 170; 71; 193; 70; 190; 72; 150; 71; 220; 74; 200; 71; 190; 72; 185; 71; 185; 74; 200; 69; 172; 76; 220; 75; 225; 75; 190; 76; 195; 73; 219; 76; 190; 73; 197; 77; 200; 73; 195; 72; 210; 72; 177; 77; 220; 77; 235; 71; 180; 74; 195; 74; 195; 73; 190; 78; 230; 75; 190; 73; 200; 70; 190; 74; 190; 72; 200; 73; 200; 73; 184; 75; 200; 75; 180; 74; 219; 76; 187; 73; 200; 74; 220; 75; 205; 75; 190; 72; 170; 73; 160; 73; 215; 72; 175; 74; 205; 78; 200; 76; 214; 73; 200; 74; 190; 75; 180; 70; 205; 75; 220; 71; 190; 72; 215; 78; 235; 75; 191; 73; 200; 73; 181; 71; 200; 75; 210; 77; 240; 72; 185; 69; 165; 73; 190; 74; 185; 72; 175; 70; 155; 75; 210; 70; 170; 72; 175; 72; 220; 74; 210; 73; 205; 74; 200; 76; 205; 75; 195; 80; 240; 72; 150; 75; 200; 73; 215; 74; 202; 74; 200; 73; 190; 75; 205; 75; 190; 71; 160; 73; 215; 75; 185; 74; 200; 74; 190; 72; 210; 74; 185; 74; 220; 74; 190; 73; 202; 76; 205; 75; 220; 72; 175; 73; 160; 73; 190; 73; 200; 72; 229; 72; 206; 72; 220; 72; 180; 71; 195; 75; 175; 75; 188; 74; 230; 73; 190; 75; 200; 79; 190; 74; 219; 76; 235; 73; 180; 74; 180; 74; 180; 72; 200; 74; 234; 74; 185; 75; 220; 78; 223; 74; 200; 74; 210; 74; 200; 77; 210; 70; 190; 73; 177; 74; 227; 73; 180; 71; 195; 75; 199; 71; 175; 72; 185; 77; 240; 74; 210; 70; 180; 77; 194; 73; 225; 72; 180; 76; 205; 71; 193; 76; 230; 78; 230; 75; 220; 73; 200; 78; 249; 74; 190; 79; 208; 75; 245; 76; 250; 72; 160; 75; 192; 75; 220; 70; 170; 72; 197; 70; 155; 74; 190; 71; 200; 76; 220; 73; 210; 76; 228; 71; 190; 69; 160; 72; 184; 72; 180; 69; 180; 73; 200; 69; 176; 73; 160; 74; 222; 74; 211; 72; 195; 71; 200; 72; 175; 72; 206; 76; 240; 76; 185; 76; 260; 74; 185; 76; 221; 75; 205; 71; 200; 72; 170; 71; 201; 73; 205; 75; 185; 76; 205; 75; 245; 71; 220; 75; 210; 74; 220; 72; 185; 73; 175; 73; 170; 73; 180; 73; 200; 76; 210; 72; 175; 76; 220; 73; 206; 73; 180; 73; 210; 75; 195; 75; 200; 77; 200; 73; 164; 72; 180; 75; 220; 70; 195; 74; 205; 72; 170; 80; 240; 71; 210; 71; 195; 74; 200; 74; 205; 73; 192; 75; 190; 76; 170; 73; 240; 77; 200; 72; 205; 73; 175; 77; 250; 76; 220; 71; 224; 75; 210; 73; 195; 74; 180; 77; 245; 71; 175; 72; 180; 73; 215; 69; 175; 73; 180; 70; 195; 74; 230; 76; 230; 73; 205; 73; 215; 75; 195; 73; 180; 79; 205; 74; 180; 73; 190; 74; 180; 77; 190; 75; 190; 74; 220; 73; 210; 77; 255; 73; 190; 77; 230; 74; 200; 74; 205; 73; 210; 77; 225; 74; 215; 77; 220; 75; 205; 77; 200; 75; 220; 71; 197; 74; 225; 70; 187; 79; 245; 72; 185; 72; 185; 70; 175; 74; 200; 74; 180; 72; 188; 73; 225; 72; 200; 74; 210; 74; 245; 76; 213; 82; 231; 74; 165; 74; 228; 70; 210; 73; 250; 73; 191; 74; 190; 77; 200; 72; 215; 76; 254; 73; 232; 73; 180; 72; 215; 74; 220; 74; 180; 71; 200; 72; 170; 75; 195; 74; 210; 74; 200; 77; 220; 70; 165; 71; 180; 73; 200; 76; 200; 71; 170; 75; 224; 74; 220; 72; 180; 76; 198; 79; 240; 76; 239; 73; 185; 76; 210; 78; 220; 75; 200; 76; 195; 72; 220; 72; 230; 73; 170; 73; 220; 75; 230; 71; 165; 76; 205; 70; 192; 75; 210; 74; 205; 75; 200; 73; 210; 71; 185; 71; 195; 72; 202; 73; 205; 73; 195; 72; 180; 69; 200; 73; 185; 78; 240; 71; 185; 73; 220; 75; 205; 76; 205; 70; 180; 74; 201; 77; 190; 75; 208; 79; 240; 72; 180; 77; 230; 73; 195; 75; 215; 75; 190; 75; 195; 73; 215; 73; 215; 76; 220; 77; 220; 75; 230; 70; 195; 71; 190; 71; 195; 75; 209; 74; 204; 69; 170; 70; 185; 75; 205; 72; 175; 75; 210; 73; 190; 72; 180; 72; 180; 72; 160; 76; 235; 75; 200; 74; 210; 69; 180; 73; 190; 72; 197; 72; 203; 75; 205; 77; 170; 76; 200; 80; 250; 77; 200; 76; 220; 79; 200; 71; 190; 75; 170; 73; 190; 76; 220; 77; 215; 73; 206; 76; 215; 70; 185; 75; 235; 73; 188; 75; 230; 70; 195; 69; 168; 71; 190; 72; 160; 72; 200; 73; 200; 70; 189; 70; 180; 73; 190; 76; 200; 75; 220; 72; 187; 73; 240; 79; 190; 71; 180; 72; 185; 74; 210; 74; 220; 74; 219; 72; 190; 76; 193; 76; 175; 72; 180; 72; 215; 71; 210; 72; 200; 72; 190; 70; 185; 77; 220; 74; 170; 72; 195; 76; 205; 71; 195; 76; 210; 71; 190; 73; 190; 70; 180; 73; 220; 73; 190; 72; 186; 71; 185; 71; 190; 71; 180; 72; 190; 72; 170; 74; 210; 74; 240; 74; 220; 71; 180; 72; 210; 75; 210; 72; 195; 71; 160; 72; 180; 72; 205; 72; 200; 72; 185; 74; 245; 74; 190; 77; 210; 75; 200; 73; 200; 75; 222; 73; 215; 76; 240; 72; 170; 77; 220; 75; 156; 72; 190; 71; 202; 71; 221; 75; 200; 72; 190; 73; 210; 73; 190; 71; 200; 70; 165; 75; 190; 71; 185; 76; 230; 73; 208; 68; 209; 71; 175; 72; 180; 74; 200; 77; 205; 72; 200; 76; 250; 78; 210; 81; 230; 72; 244; 73; 202; 76; 240; 72; 200; 72; 215; 74; 177; 76; 210; 73; 170; 76; 215; 75; 217; 70; 198; 71; 200; 74; 220; 72; 170; 73; 200; 76; 230; 76; 231; 73; 183; 71; 192; 68; 167; 71; 190; 71; 180; 74; 180; 77; 215; 69; 160; 72; 205; 76; 223; 75; 175; 76; 170; 75; 190; 76; 240; 72; 175; 74; 230; 76; 223; 74; 196; 72; 167; 75; 195; 78; 190; 77; 250; 70; 190; 72; 190; 79; 190; 74; 170; 71; 160; 68; 150; 77; 225; 75; 220; 71; 209; 72; 210; 70; 176; 72; 260; 72; 195; 73; 190; 72; 184; 74; 180; 72; 195; 72; 195; 75; 219; 72; 225; 73; 212; 74; 202; 72; 185; 78; 200; 75; 209; 72; 200; 74; 195; 75; 228; 75; 210; 76; 190; 74; 212; 74; 190; 73; 218; 74; 220; 71; 190; 74; 235; 75; 210; 76; 200; 74; 188; 76; 210; 76; 235; 73; 188; 75; 215; 75; 216; 74; 220; 68; 180; 72; 185; 75; 200; 71; 210; 70; 220; 72; 185; 73; 231; 72; 210; 75; 195; 74; 200; 70; 205; 76; 200; 71; 190; 82; 250; 72; 185; 73; 180; 74; 170; 71; 180; 75; 208; 77; 235; 72; 215; 74; 244; 72; 220; 73; 185; 78; 230; 77; 190; 73; 200; 73; 180; 73; 190; 73; 196; 73; 180; 76; 230; 75; 224; 70; 160; 73; 178; 72; 205; 73; 185; 75; 210; 74; 180; 73; 190; 73; 200; 76; 257; 73; 190; 75; 220; 70; 165; 77; 205; 72; 200; 77; 208; 74; 185; 75; 215; 75; 170; 75; 235; 75; 210; 72; 170; 74; 180; 71; 170; 76; 190; 71; 150; 75; 230; 76; 203; 83; 260; 75; 246; 74; 186; 76; 210; 72; 198; 72; 210; 75; 215; 75; 180; 72; 200; 77; 245; 73; 200; 72; 192; 70; 192; 74; 200; 72; 192; 74; 205; 72; 190; 71; 186; 70; 170; 71; 197; 76; 219; 74; 200; 76; 220; 74; 207; 74; 225; 74; 207; 75; 212; 75; 225; 71; 170; 71; 190; 74; 210; 77; 230; 71; 210; 74; 200; 75; 238; 77; 234; 76; 222; 74; 200; 76; 190; 72; 170; 71; 220; 72; 223; 75; 210; 73; 215; 68; 196; 72; 175; 69; 175; 73; 189; 73; 205; 75; 210; 70; 180; 70; 180; 74; 197; 75; 220; 74; 228; 74; 190; 73; 204; 74; 165; 75; 216; 77; 220; 73; 208; 74; 210; 76; 215; 74; 195; 75; 200; 73; 215; 76; 229; 78; 240; 75; 207; 73; 205; 77; 208; 74; 185; 72; 190; 74; 170; 72; 208; 71; 225; 73; 190; 75; 225; 73; 185; 67; 180; 67; 165; 76; 240; 74; 220; 73; 212; 70; 163; 75; 215; 70; 175; 72; 205; 77; 210; 79; 205; 78; 208; 74; 215; 75; 180; 75; 200; 78; 230; 76; 211; 75; 230; 69; 190; 75; 220; 72; 180; 75; 205; 73; 190; 74; 180; 75; 205; 75; 190; 73; 195;

5.11 Lecture: Loop Data Structures, Part 2

5.12 Loop over DataFrame

Iterating over a Pandas DataFrame is typically done with the iterrows() method. Used in a for loop, every observation is iterated over and on every iteration the row label and actual row contents are available:

for lab, row in brics.iterrows() :
    ...

In this exercise you will be working on the cars DataFrame. It contains information on the cars per capita and whether people drive right or left for seven countries in the world.

for lab, row in cars.iterrows():
  print(lab)
  print(row)
## US
## cars_per_cap              809
## country         United States
## drives_right             True
## Name: US, dtype: object
## AUS
## cars_per_cap          731
## country         Australia
## drives_right        False
## Name: AUS, dtype: object
## JPN
## cars_per_cap      588
## country         Japan
## drives_right    False
## Name: JPN, dtype: object
## IN
## cars_per_cap       18
## country         India
## drives_right    False
## Name: IN, dtype: object
## RU
## cars_per_cap       200
## country         Russia
## drives_right      True
## Name: RU, dtype: object
## MOR
## cars_per_cap         70
## country         Morocco
## drives_right       True
## Name: MOR, dtype: object
## EG
## cars_per_cap       45
## country         Egypt
## drives_right     True
## Name: EG, dtype: object

The row data that’s generated by iterrows() on every run is a Pandas Series. This format is not very convenient to print out. Luckily, you can easily select variables from the Pandas Series using square brackets:

for lab, row in brics.iterrows() :
    print(row['country'])
# Adapt for loop
for lab, row in cars.iterrows() :
    print(str(lab) + ': ' + str(row['cars_per_cap']))
## US: 809
## AUS: 731
## JPN: 588
## IN: 18
## RU: 200
## MOR: 70
## EG: 45

5.13 Add column

In the video, Hugo showed you how to add the length of the country names of the brics DataFrame in a new column:

for lab, row in brics.iterrows() :
    brics.loc[lab, "name_length"] = len(row["country"])

You can do similar things on the cars DataFrame.

# loop that adds COUNTRY column
for lab, row in cars.iterrows():
  cars.loc[lab, 'COUNTRY'] = row['country'].upper()

# Print cars
print(cars)
##      cars_per_cap        country  drives_right        COUNTRY
## US            809  United States          True  UNITED STATES
## AUS           731      Australia         False      AUSTRALIA
## JPN           588          Japan         False          JAPAN
## IN             18          India         False          INDIA
## RU            200         Russia          True         RUSSIA
## MOR            70        Morocco          True        MOROCCO
## EG             45          Egypt          True          EGYPT

Using iterrows() to iterate over every observation of a Pandas DataFrame is easy to understand, but not very efficient. On every iteration, you’re creating a new Pandas Series.

If you want to add a column to a DataFrame by calling a function on another column, the iterrows() method in combination with a for loop is not the preferred way to go. Instead, you’ll want to use apply().

Compare the iterrows() version with the apply() version to get the same result in the brics DataFrame:

for lab, row in brics.iterrows() :
    brics.loc[lab, "name_length"] = len(row["country"])

brics["name_length"] = brics["country"].apply(len)

We can do a similar thing to call the upper() method on every name in the country column. However, upper() is a method, so we’ll need a slightly different approach:

# Use .apply(str.upper)
cars['COUNTRY'] = cars["country"].apply(lambda x: x.upper())
print(cars)
##      cars_per_cap        country  drives_right        COUNTRY
## US            809  United States          True  UNITED STATES
## AUS           731      Australia         False      AUSTRALIA
## JPN           588          Japan         False          JAPAN
## IN             18          India         False          INDIA
## RU            200         Russia          True         RUSSIA
## MOR            70        Morocco          True        MOROCCO
## EG             45          Egypt          True          EGYPT

6 Case Study: Hacker Statistics

This chapter will allow you to apply all the concepts you’ve learned in this course. You will use hacker statistics to calculate your chances of winning a bet. Use random number generators, loops, and Matplotlib to gain a competitive edge!

6.1 Lecture: Random Numbers

6.2 Random float

Randomness has many uses in science, art, statistics, cryptography, gaming, gambling, and other fields. You’re going to use randomness to simulate a game.

All the functionality you need is contained in the random package, a sub-package of numpy. You’ll be using two functions from this package: - seed(): sets the random seed, so that your results are reproducible between simulations. As an argument, it takes an integer of your choosing. If you call the function, no output will be generated. - rand(): if you don’t specify any arguments, it generates a random float between zero and one.

# Set the seed
np.random.seed(123)

# Generate and print random float
print(np.random.rand())
## 0.6964691855978616

Great! Now let’s simulate a dice.

6.3 Roll the dice

As Hugo explained in the video you can just as well use randint(), also a function of the random package, to generate integers randomly. The following call generates the integer 4, 5, 6 or 7 randomly. 8 is not included.

np.random.randint(4, 8)
# Use randint() to simulate a dice
print(np.random.randint(1, 7))
## 3
# Use randint() again
print(np.random.randint(1, 7))
## 5

Alright! Time to actually start coding things up!

6.4 Determine your next move

In the Empire State Building bet, your next move depends on the number you get after throwing the dice. We can perfectly code this with an if-elif-else construct!

# Starting step
step = 50

# Roll the dice
dice = np.random.randint(1, 7)

if dice <= 2 :
    step = step - 1
elif dice < 6 :
    step = step + 1
else :
    step = step + np.random.randint(1,7)

# Print out dice and step
print([dice, step])
## [3, 51]

6.5 Lecture: Random Walk

6.6 The next step

Before, you have already written Python code that determines the next step based on the previous step. Now it’s time to put this code inside a for loop so that we can simulate a random walk.

# Initialize random_walk
random_walk = [0]

for x in range(100) :
    # Set step: last element in random_walk
    step = random_walk[-1]

    # Roll the dice
    dice = np.random.randint(1,7)

    # Determine next step
    if dice <= 2:
        step = step - 1
    elif dice <= 5:
        step = step + 1
    else:
        step = step + np.random.randint(1,7)

    # append next_step to random_walk
    random_walk.append(step)

# Print random_walk
print(random_walk)
## [0, -1, 0, 1, 2, 1, 0, -1, -2, -3, -4, -5, -6, -5, 0, -1, -2, -1, -2, -1, 0, 1, 2, 3, 2, 3, 2, 3, 4, 5, 6, 5, 9, 10, 9, 10, 9, 10, 11, 12, 13, 14, 15, 16, 19, 20, 21, 22, 27, 28, 32, 33, 32, 33, 34, 33, 34, 35, 37, 38, 39, 38, 37, 38, 39, 38, 37, 38, 39, 41, 40, 39, 40, 39, 40, 41, 42, 44, 43, 44, 45, 46, 47, 48, 47, 46, 47, 46, 47, 48, 47, 50, 51, 52, 53, 52, 53, 54, 58, 57, 56]

6.7 How low can you go?

Things are shaping up nicely! You already have code that calculates your location in the Empire State Building after 100 dice throws. However, there’s something we haven’t thought about - you can’t go below 0!

A typical way to solve problems like this is by using max(). If you pass max() two arguments, the biggest one gets returned. For example, to make sure that a variable x never goes below 10 when you decrease it, you can use:

x = max(10, x - 1)
random_walk = [0]

for x in range(100) :
    step = random_walk[-1]
    dice = np.random.randint(1,7)

    if dice <= 2:
        # use max to make sure step can't go below 0
        step = max(0, step - 1)
    elif dice <= 5:
        step = step + 1
    else:
        step = step + np.random.randint(1,7)

    random_walk.append(step)

print(random_walk)
## [0, 2, 1, 2, 4, 5, 6, 11, 10, 11, 12, 13, 14, 15, 14, 19, 20, 21, 22, 21, 20, 19, 18, 17, 18, 19, 20, 26, 25, 24, 23, 24, 25, 26, 25, 26, 27, 26, 31, 32, 31, 30, 29, 28, 29, 28, 27, 29, 30, 33, 34, 36, 37, 38, 39, 38, 37, 38, 39, 40, 41, 40, 41, 42, 43, 46, 47, 48, 47, 48, 47, 48, 49, 50, 54, 53, 52, 53, 54, 55, 54, 55, 54, 55, 57, 62, 61, 62, 63, 64, 65, 66, 67, 66, 67, 68, 69, 71, 73, 72, 73]

You’re not going below zero anymore. Great!

6.8 Visualize the walk

Let’s visualize this random walk! Remember how you could use matplotlib to build a line plot?

import matplotlib.pyplot as plt
plt.plot(x, y)
plt.show()

The first list you pass is mapped onto the x-axis and the second list is mapped onto the y-axis.

If you pass only one argument, Python will know what to do and will use the index of the list to map onto the x-axis, and the values in the list onto the y-axis.

# Initialization
random_walk = [0]

for x in range(100) :
    step = random_walk[-1]
    dice = np.random.randint(1,7)

    if dice <= 2:
        step = max(0, step - 1)
    elif dice <= 5:
        step = step + 1
    else:
        step = step + np.random.randint(1,7)

    random_walk.append(step)

# Plot random_walk
plt.plot(random_walk)

# Show the plot
plt.show()

This is pretty cool! You can clearly see how your random walk progressed.

6.9 Lecture: Distribution

6.10 Simulate multiple walks

A single random walk is one thing, but that doesn’t tell you if you have a good chance at winning the bet. To get an idea about how big your chances are of reaching 60 steps, you can repeatedly simulate the random walk and collect the results.

# Initialize all_walks
all_walks = []

# Simulate random walk 10 times
for i in range(10) :

    random_walk = [0]
    for x in range(100) :
        step = random_walk[-1]
        dice = np.random.randint(1,7)

        if dice <= 2:
            step = max(0, step - 1)
        elif dice <= 5:
            step = step + 1
        else:
            step = step + np.random.randint(1,7)
        random_walk.append(step)

    # Append random_walk to all_walks
    all_walks.append(random_walk)

# Print all_walks
print(all_walks)
## [[0, 1, 2, 3, 4, 5, 6, 7, 6, 7, 6, 5, 6, 5, 6, 5, 6, 7, 11, 12, 11, 17, 16, 15, 16, 15, 14, 15, 14, 18, 17, 18, 17, 18, 17, 18, 20, 19, 18, 17, 18, 17, 22, 23, 24, 23, 22, 23, 22, 23, 22, 27, 28, 27, 26, 25, 24, 25, 26, 30, 36, 37, 38, 39, 40, 39, 40, 42, 43, 44, 45, 44, 43, 44, 43, 44, 45, 46, 45, 46, 47, 48, 47, 46, 47, 48, 53, 54, 55, 60, 59, 60, 59, 60, 61, 62, 63, 62, 68, 67, 68], [0, 0, 0, 1, 5, 6, 7, 8, 9, 8, 7, 6, 5, 4, 5, 6, 7, 8, 9, 10, 9, 10, 11, 10, 11, 12, 15, 14, 15, 14, 15, 18, 19, 20, 21, 20, 19, 22, 23, 24, 25, 24, 23, 24, 27, 28, 33, 34, 33, 34, 33, 34, 33, 39, 38, 37, 38, 40, 39, 38, 37, 38, 39, 40, 41, 45, 50, 51, 52, 53, 56, 57, 58, 59, 60, 61, 62, 61, 60, 61, 62, 61, 67, 66, 67, 68, 67, 66, 67, 66, 65, 71, 70, 69, 70, 71, 70, 69, 68, 67, 68], [0, 6, 7, 10, 11, 17, 18, 19, 25, 24, 30, 29, 30, 31, 32, 31, 37, 38, 37, 38, 37, 38, 37, 38, 37, 38, 42, 43, 45, 44, 45, 44, 43, 44, 43, 44, 43, 47, 51, 50, 49, 48, 49, 50, 54, 55, 56, 60, 59, 58, 57, 58, 59, 61, 60, 59, 60, 61, 63, 66, 71, 72, 71, 72, 73, 74, 75, 76, 75, 76, 77, 83, 82, 87, 86, 90, 89, 93, 92, 95, 96, 95, 96, 102, 101, 100, 99, 103, 102, 101, 102, 103, 102, 103, 104, 105, 106, 105, 104, 103, 104], [0, 0, 0, 4, 5, 7, 11, 17, 16, 15, 16, 17, 18, 17, 18, 17, 18, 19, 18, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 33, 32, 35, 36, 35, 34, 35, 36, 37, 36, 35, 34, 33, 34, 35, 36, 37, 38, 39, 40, 39, 40, 41, 43, 42, 43, 44, 47, 49, 50, 49, 48, 47, 46, 45, 46, 45, 46, 48, 49, 50, 49, 50, 49, 48, 49, 48, 47, 46, 47, 46, 45, 46, 47, 48, 50, 51, 52, 51, 50, 51, 57, 56, 57, 58, 63, 62, 63, 62, 63, 64], [0, 0, 1, 2, 8, 9, 10, 11, 10, 12, 13, 14, 15, 14, 15, 16, 17, 18, 17, 18, 17, 18, 19, 18, 19, 23, 24, 27, 28, 32, 33, 32, 33, 34, 33, 32, 37, 38, 39, 38, 37, 38, 39, 40, 39, 43, 42, 43, 44, 45, 46, 47, 48, 49, 48, 47, 46, 47, 48, 52, 53, 52, 53, 54, 53, 59, 60, 61, 62, 61, 62, 63, 66, 65, 66, 65, 64, 63, 64, 65, 67, 68, 69, 73, 74, 73, 72, 73, 74, 73, 72, 73, 74, 75, 74, 73, 74, 75, 76, 75, 76], [0, 1, 0, 0, 0, 1, 2, 3, 4, 5, 10, 14, 13, 14, 13, 12, 11, 12, 11, 12, 13, 12, 16, 17, 16, 17, 16, 15, 16, 15, 19, 20, 21, 22, 23, 24, 23, 24, 25, 26, 27, 28, 27, 32, 33, 34, 33, 34, 33, 34, 35, 34, 35, 40, 41, 42, 41, 42, 43, 44, 43, 44, 43, 44, 45, 44, 43, 42, 43, 44, 43, 42, 41, 42, 46, 47, 48, 49, 50, 51, 50, 51, 52, 51, 52, 57, 58, 57, 56, 57, 56, 55, 54, 58, 59, 60, 61, 60, 61, 62, 63], [0, 1, 2, 1, 0, 3, 2, 1, 0, 0, 1, 7, 8, 7, 8, 9, 8, 7, 8, 9, 10, 9, 13, 14, 13, 15, 16, 15, 16, 17, 18, 19, 20, 21, 20, 19, 20, 21, 20, 21, 22, 21, 20, 19, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 31, 32, 33, 34, 35, 36, 35, 34, 40, 41, 42, 41, 40, 39, 43, 44, 48, 47, 53, 54, 55, 59, 60, 59, 58, 59, 60, 61, 62, 61, 67, 68, 67, 71, 72, 71, 72, 71, 77, 83, 84, 83, 84, 85, 86, 87, 88], [0, 0, 3, 2, 4, 5, 11, 10, 11, 12, 11, 10, 11, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 23, 24, 25, 26, 25, 24, 23, 24, 23, 27, 26, 25, 26, 28, 29, 34, 33, 34, 35, 39, 38, 39, 40, 39, 38, 39, 40, 41, 40, 39, 38, 39, 38, 37, 38, 37, 36, 35, 36, 37, 36, 35, 34, 35, 36, 37, 36, 35, 36, 37, 38, 39, 38, 39, 38, 39, 40, 41, 42, 43, 48, 53, 52, 53, 54, 53, 54, 60, 59, 60, 59, 60, 59], [0, 1, 2, 3, 2, 1, 2, 3, 4, 3, 2, 1, 3, 4, 5, 4, 3, 2, 3, 4, 5, 4, 3, 4, 7, 12, 15, 16, 17, 23, 24, 25, 26, 25, 27, 32, 33, 34, 35, 36, 37, 38, 37, 38, 39, 40, 41, 42, 44, 48, 49, 50, 51, 52, 56, 61, 60, 59, 58, 57, 60, 61, 62, 63, 62, 61, 64, 65, 64, 63, 62, 63, 64, 65, 66, 65, 66, 65, 66, 67, 66, 67, 68, 69, 70, 71, 72, 73, 72, 71, 72, 73, 76, 77, 76, 75, 76, 77, 78, 83, 82], [0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 4, 3, 2, 3, 4, 5, 4, 5, 6, 7, 8, 9, 10, 11, 12, 15, 21, 22, 23, 24, 25, 26, 25, 24, 23, 24, 25, 26, 27, 29, 30, 31, 32, 34, 38, 37, 36, 35, 34, 35, 36, 37, 36, 35, 34, 33, 32, 31, 32, 36, 40, 41, 42, 41, 40, 41, 42, 43, 49, 50, 49, 48, 49, 48, 49, 48, 49, 50, 49, 50, 49, 48, 49, 50, 49, 50, 49, 50, 53, 54, 55, 56, 57, 56, 57, 58, 63, 62, 63, 64, 65]]

6.11 Visualize all walks

all_walks is a list of lists: every sub-list represents a single random walk. If you convert this list of lists to a NumPy array, you can start making interesting plots!

# initialize and populate all_walks
all_walks = []
for i in range(10) :
    random_walk = [0]
    for x in range(100) :
        step = random_walk[-1]
        dice = np.random.randint(1,7)
        if dice <= 2:
            step = max(0, step - 1)
        elif dice <= 5:
            step = step + 1
        else:
            step = step + np.random.randint(1,7)
        random_walk.append(step)
    all_walks.append(random_walk)

# Convert all_walks to NumPy array: np_aw
np_aw = np.array(all_walks)

# Plot np_aw and show
plt.plot(np_aw)
plt.show()

# Clear the figure
plt.clf()

# Transpose np_aw: np_aw_t
np_aw_t = np.transpose(np_aw)

# Plot np_aw_t and show
plt.plot(np_aw_t)
plt.show()

Good job! You can clearly see how the different simulations of the random walk went. Transposing the 2D NumPy array was crucial; otherwise Python misunderstood.

6.12 Implement clumsiness

There’s still something we forgot! You’re a bit clumsy and you have a 0.5% chance of falling down. That calls for another random number generation. Basically, you can generate a random float between 0 and 1. If this value is less than or equal to 0.005, you should reset step to 0.

# Simulate random walk 250 times
all_walks = []
for i in range(250) :
    random_walk = [0]
    for x in range(100) :
        step = random_walk[-1]
        dice = np.random.randint(1,7)
        if dice <= 2:
            step = max(0, step - 1)
        elif dice <= 5:
            step = step + 1
        else:
            step = step + np.random.randint(1,7)

        # Implement clumsiness
        if np.random.rand() <= 0.005:
            step = 0

        random_walk.append(step)
    all_walks.append(random_walk)

# Create and plot np_aw_t
np_aw_t = np.transpose(np.array(all_walks))
plt.plot(np_aw_t)
plt.show()

Superb! Look at the plot. In some of the simulations you’re indeed taking a deep dive down!

6.13 Plot the distribution

All these fancy visualizations have put us on a sidetrack. We still have to solve the million-dollar problem: What are the odds that you’ll reach 60 steps high on the Empire State Building?

Basically, you want to know about the end points of all the random walks you’ve simulated. These end points have a certain distribution that you can visualize with a histogram.

# Simulate random walk 500 times
all_walks = []
for i in range(500) :
    random_walk = [0]
    for x in range(100) :
        step = random_walk[-1]
        dice = np.random.randint(1,7)
        if dice <= 2:
            step = max(0, step - 1)
        elif dice <= 5:
            step = step + 1
        else:
            step = step + np.random.randint(1,7)
        if np.random.rand() <= 0.001 :
            step = 0
        random_walk.append(step)
    all_walks.append(random_walk)

# Create and plot np_aw_t
np_aw_t = np.transpose(np.array(all_walks))

# Select last row from np_aw_t: ends
ends = np_aw_t[-1,:]

# Plot histogram of ends, display plot
plt.hist(ends)
plt.show()

Great job! Have a look at a histogram; what do you think your chances are?

7 Final Words

Congratulations on completing the course! More courses, tracks and instructions can be found here. Happy learning!